Introduction#

Modeling a complex service is a time-consuming process that may require many rounds of fine-tuning. In this lesson, we’ll discuss how we can achieve the non-functional requirements, especially real-time communication, and estimate the response time of our proposed Zoom meeting API.

Non-functional requirements#

Let's discuss the non-functional requirements for our Zoom API one by one:

Availability and reliability#

We ensure the availability of our services by dividing servers according to different roles. For example, the meeting service handles requests to create, update, add participants, and so on, while the media controller handles client requests for managing meeting sessions. By adopting a role-based style, we can separate different workflows. In the event of a failure, if one service goes down, the other can still run normally, making our system resilient to complete outages. Additionally, services and data are replicated across different geological regions to avoid single points of failure (SPOF). We also have API monitoring and circuit breakers to identify and handle bad situations as quickly as possible. We limit concurrent meeting requests based on the account type for efficient resource management. For free users, we also limit the maximum time for a meeting to avoid the overloading of servers.

Improving availability and reliability with separation of workflows
Improving availability and reliability with separation of workflows

Security#

We use TLS/1.3 for normal communication, and to exchange AES keys for multimedia transmission. After successfully sharing the key, the connection is upgraded to WebSockets for AES-encrypted data transfers. We implement authentication/authorization using a login mechanism and OAuth, and OpenID Connect with PKCE flows for third-party interactions (see: the authorization framework). Connecting to the media router requires an access token. Guest (unregistered) participants can also join using their access token, which is only issued when the host accepts their join request.

Scalability#

Locally distributed media routers make scaling services easier. We also have decoupled media routers and media controllers, which allow us to deploy multiple media routers in an area controlled by a single controller, making this a cost-effective solution. Stateless communication between the conferencing service and the media controller allows efficient resource management during workload peaks.

Point to Ponder

Question

What determines the maximum number of users a service like Zoom can handle in a single meeting?

Hide Answer

It is difficult to know the exact participant count that a service can handle. Determining the upper limit of a service depends on many factors (some of which are dynamically changing):

  • Device specifications such as processing power, memory, etc.
  • The number of features provided by the services and the extent to which those features interact.
  • The number of components involved and their capabilities, such as databases, third-party services (if any), etc.
  • Number of active users. For example, there could be a hundred people in a meeting, but only a few actually have their cameras on sharing video. Compare that to 40 participants, all sharing videos and actively collaborating. Then the latter case would put more load on the service than a hundred inactive users.
  • Many other factors, such as rate limiting, data type, average response time, available bandwidth, and so on.

All of these factors together determine the operability of a service with good quality, and we can only estimate (at the design level) a safe number that a service should support based on the service’s SLA. Real services will run empirical tests to come up with a reasonable number. For services like Zoom, it’s recommended to support at least 500 participants per meeting.

Optimization and tradeoffs#

  • The stateful nature of WebSockets can be a scalability issue for our service, which is inevitable due to the two-way and real-time nature of the service. However, we may scale our service by increasing the number of regional media servers, which is an expensive solution, but there is always some sort of tradeoff.

  • Additionally, because we’ve learned from a previous lesson that MCU requires computation and SFU cast bandwidth, we can take a hybrid approach by having servers act as both. The server can intelligently switch to MCU when there is limited bandwidth, and can shift to simulcast SFU when the network conditions are better. We can also create client meshes within enterprise networks for meeting with a small number of participants. Moreover, we can preallocate resources for scheduled meetings with the expected values of attendees.

The enhanced architecture of the Zoom meeting service
The enhanced architecture of the Zoom meeting service

This allows us the flexibility to create either peer-to-peer communication (for small groups on the same network) or employ media servers where effective in an enterprise network. By adopting this enhanced version, we can have the following benefits:

  • We can independently manage rooms for small groups that are meeting within the enterprise networks and offload some of the work from media servers. With this approach, client devices will interact directly by creating a peer-to-peer connection, and we can always shift to servers when the number of participants exceeds a certain threshold.

  • We can improve the overall user experience by moving from SFU to a hybrid (MCU and SFU) approach that allows the service to adapt to network conditions more effectively.

Low latency#

Routing media through locally distributed servers reduces overall user-perceived latency. These servers act as simulcast SFUs, and adaptively upscale or downscale video resolution based on the network conditions. We also deploy media controllers in different geographic regions to facilitate a smooth user experience in joining and controlling meeting sessions. Furthermore, the communication between the client and the media server is based on WebSockets, which is relatively faster than HTTP-based communication, helping us achieve fast bidirectional data flow with low latency.

Achieving Non-Functional Requirements

Non-Functional Requirements

Approaches



Availability and reliability

  • Role-based service
  • Avoid SPOF by replicating data over different data centers
  • Use of circuit breakers to restrict failure
  • Limit requests to one meeting per client
  • Fallback support to HTTP/2
  • API monitoring for early fault detection


Security


  • AES encrypted data transmission
  • Authentication and Authorization using a login mechanism
  • OAuth and Open ID Connect with PKCE flow for third-party interactions
  • Access token for guest participation


Scalability

  • The regionally distributed media router
  • Decoupled media service and meeting service
  • Stateless communication between meeting service and clients


Low latency

  • Route media through regionally distributed media router
  • Use of simulcast SFU for adapting to network conditions
  • Regionally distributed media controllers for swift session startup

Latency budget#

Latency estimation for our Zoom API involves calculating the latency of the following three events:

  • Joining a meeting

  • Setting up a session

  • Exchanging video clips

As discussed in the back-of-the-envelope calculations for latency, the latency of the GET and POST requests are affected by two different parameters. In the case of GET, the average RTTRTT remains the same regardless of the data size (due to the small request size), and the time to download the response varies by 0.4 ms0.4\ ms per KB. Similarly, for POST requests, the RTTRTT time changes with the data size by 1.15 ms1.15\ ms per KB after the base RTT time (the minimum RTT taken by a request with the smallest data size), which was 260 ms260\ ms.

Let's discuss each of the above points one by one:

Joining a meeting#

Clients initiate simple GET requests to obtain meeting details. Let's calculate the message size and  using that message size, also estimate the response time for this request to complete.

Request and response size#

Let's assume that the response size of the request is 4 KB, which contains the meeting details, such as start_time, agenda, and settings, etc.

Response size=4 KB Response\ size = 4\ KB

Response time#

We can put the response size in the following calculator and estimate the minimum and maximum response time for a joining meeting wait room.

Response Time Calculator to Join a Meeting

Enter the size in KBs4KB
Minimum latencyf192.1ms
Maximum latencyf273.1ms
Minimum response timef196.1ms
Maximum response timef277.1ms

Assuming the response size is 4 KBs, the latency is calculated by:

Timelatency_min=Timebase_min+RTTget+0.4×size of response (KBs)=120.5+70+0.4×4=192.1 msTime_{latency\_min} = Time_{base\_min} + RTT_{get} + 0.4 \times size\ of\ response\ (KBs) = 120.5 + 70 + 0.4 \times 4 = 192.1\ ms

Timelatency_max=Timebase_max+RTTget+0.4×size of response (KBs)=201.5+70+0.4×4=273.1 msTime_{latency\_max} = Time_{base\_max} + RTT_{get} + 0.4 \times size\ of\ response\ (KBs) = 201.5 + 70 + 0.4 \times 4 = 273.1\ ms

Similarly, the response time is calculated using the following equation:

TimeResponse=Timelatency+TimeprocessingTime_{Response} = Time_{latency}+ Time_{processing}

Now, for minimum response time, we use the minimum values of base time and processing time:

TimeResponse_min=Timelatency_min+Timeprocessing_min=192.1 ms+4 ms=196.1 msTime_{Response\_min} = Time_{latency\_min}+ Time_{processing\_min}= 192.1\ ms + 4\ ms = 196.1\ ms

Now, for maximum response time, we use the maximum values of base time and processing time:

TimeResponse_max=Timelatency_max+Timeprocessing_max=273.1 ms+4 ms=277.1 msTime_{Response\_max} = Time_{latency\_max}+ Time_{processing\_max}= 273.1\ ms + 4\ ms = 277.1\ ms

Setting up a session#

We use standard HTTP POST requests to store client sessions on the media controller server. Let's also calculate a rough estimate of the response time for storing a user session successfully.

Request and response size#

Let's assume that, on average, the session description has a size of 2 KB, which contains information about media type, media encoding, required bandwidth, and so on. The estimated size for the response returned is 2 KB, which contains the access token and configuration ID created on the media controller. We know from previous lessons that the response size of 2 KB for a POST request is standard. Therefore, we can only use the request size to calculate the response time.

Request size=2 KB Request\ size = 2\ KB

Response time#

Let's put the request size in the following calculator to get the minimum and maximum response for storing a session on the media controller:

Response Time Calculator to Share a Session Description

Enter the size in KBs2KB
Minimum latencyf383.2ms
Maximum latencyf464.2ms
Minimum response timef387.2ms
Maximum response timef468.2ms

Assuming the request size is 2 KBs:

Timelatency=Timebase+RTTpost+DownloadTime_{latency} = Time_{base} + RTT_{post} + Download

RTTpost=RTTbase+1.15×Size=260 ms+1.15 ms×2 KBsRTT_{post} = RTT_{base}+1.15\times Size = 260\ ms + 1.15\ ms\times 2\ KBs

Timelatency_min=Timebase_min+(RTTbase+1.15×size of request (KBs))+0.4Time_{latency\_min} = Time_{base\_min} + (RTT_{base} + 1.15 \times size\ of\ request\ (KBs)) + 0.4

=120.5+(260+1.15×2)+0.4=383.2 ms= 120.5 + (260 + 1.15 \times 2) + 0.4 = 383.2\ ms

Timelatency_max=Timebase_max+(RTTbase+1.15×size of request (KBs))+0.4Time_{latency\_max} = Time_{base\_max} + (RTT_{base} + 1.15\times size\ of\ request\ (KBs)) + 0.4

=201.5+260+1.15×2+0.4=464.2 ms= 201.5 + 260 + 1.15 \times 2 + 0.4 = 464.2\ ms

Similarly, the response time is calculated as follows:

TimeResponse_min=Timelatency_min+Timeprocessing_min=383.2 ms+4 ms=387.2 msTime_{Response\_min} = Time_{latency\_min}+ Time_{processing\_min}= 383.2\ ms + 4\ ms = 387.2\ ms

TimeResponse_max=Timelatency_max+Timeprocessing_max=464.2 ms+4 ms=468.2 msTime_{Response\_max} = Time_{latency\_max}+ Time_{processing\_max}= 464.2\ ms + 4\ ms = 468.2\ ms

Exchanging video clips#

This event consists of two steps:

  • The HTTP upgrade request

  • The data exchange on WebSockets

As we know from our previous discussions, upgrade requests are only made by exchanging HTTP headers and can take a maximum of 275.9 ms275.9\ ms. Although, we have done some custom calculations to get the WebSockets upgrade latency, which is 275.9 ms275.9\ ms (details of this calculation can be viewed here). Moreover, the time it takes to upgrade to WebSockets does not matter much, because it only happens once at the start of the meeting.

Now, let's move on to the next step and take an example where a person sends a full HD stream of 1080p resolution and receives a similar HD stream, along with ten low-quality streams of 144p thumbnail-sized videos. Let's calculate the latency of a one-second clip sent and received by the client.

Message size#

Assuming the encoding used to send and receive the stream is high-quality H264, and the devices are configured to send the stream at 30fps, the estimated size of one second of video for such a setup is as follows:

Message sizeoutgoing=614.4 KB Message\ size_{outgoing}= 614.4\ KB

Assuming the size of each 144p thumbnail-sized video is 12 KB in the same configuration, and the one-second 1080p clip size is the same as above, we can calculate the incoming message size as follows:

Message sizeincoming=614.4 KB+(10×12) KB=734.4 KB Message\ size _{incoming}=614.4\ KB + (10\times12)\ KB = 734.4\ KB

Response time#

Knowing that data is transferred over an already established WebSockets connection, we can only calculate the transfer time using the following formula:

Latency=Message size×0.4+Base propagation delayLatency = Message\ size\times0.4+ Base\ propagation \ delay
Latencyoutgoing=614.4×0.4+35=280.76Latency_{outgoing} = 614.4\times0.4+ 35= 280.76
Latencyincoming=734.4×0.4+35=328.76 msLatency_{incoming} = 734.4\times0.4+ 35= 328.76\ ms

Note: Here, we can ignore other factors such as base time, request compile time, and so on because WebSockets do not follow the request-response model.

Let's use the following calculator to add the forwarding time taken by our simulcast SFU server and get the overall estimated response time:

User Perceived Latency Calculator for Exchanging Video Clips

Enter the chunk size in KBs614.4KB
Thumbnail-sized video in KBs12KB
No. of thumbnail-sized videos10Integer
Outgoing stream latencyf280.76ms
Incoming stream latencyf328.76ms
User perceived latencyf613.52ms

The overall latency budget for exchanging video clips using our Zoom API is summarized in the illustration below:

Response time for exchanging video clips of the Zoom API
Response time for exchanging video clips of the Zoom API

Note: The above response times are calculated using formulas derived for HTTP requests. But generally, WebSockets are more lightweight than HTTP because their headers are relatively small, which will further reduce the response time of each request. Furthermore, these calculations are performed for distant client-server communication, whereas our design takes server proximity into account. Therefore, we can conclude that while these numbers are estimated to be high, they will be quite low when served from nearby locations.

Summary#

In this lesson, we discussed how our API meets non-functional requirements. We also learned how to further improve the performance and scalability of the service. Finally, we estimated the average time for the expected response from the API.

API Model for Zoom Service

Requirements of the LeetCode API